Benjamin Franklin
Benjamin Franklin
There are going to be files
LOTS of files
The files will change over time
The files will have relationships to each other
It'll probably get complicated
File organization and naming is a mighty weapon against chaos
Make a file's name and location VERY INFORMATIVE about what it is, why it exists, how it relates to other things
The more things are self-explanatory, the better
READMEs are great, but don't document something if you could just make that thing self-documenting by definition
Pick a strategy, any strategy, just pick one!
data data-raw data-clean data/ - raw - clean
Pick a strategy, any strategy, just pick one!
code scripts analysis bin
Pick a strategy, any strategy, just pick one!
figures results results/ - figs - nums figures tables
/Users/jenny/research/bohlmann/White_Pine_Weevil_DE: total used in directory 246648 available 131544558 drwxr-xr-x 14 jenny staff 476 Jun 23 2014 . drwxr-xr-x 4 jenny staff 136 Jun 23 2014 .. -rw-r--r--@ 1 jenny staff 15364 Apr 23 10:19 .DS_Store -rw-r--r-- 1 jenny staff 126231190 Jun 23 2014 .RData -rw-r--r-- 1 jenny staff 19148 Jun 23 2014 .Rhistory drwxr-xr-x 3 jenny staff 102 May 16 2014 .Rproj.user drwxr-xr-x 17 jenny staff 578 Apr 29 10:20 .git -rw-r--r-- 1 jenny staff 50 May 30 2014 .gitignore -rw-r--r-- 1 jenny staff 1003 Jun 23 2014 README.md -rw-r--r-- 1 jenny staff 205 Jun 3 2014 White_Pine_Weevil_DE.Rproj drwxr-xr-x 20 jenny staff 680 Apr 14 15:44 analysis drwxr-xr-x 7 jenny staff 238 Jun 3 2014 data drwxr-xr-x 22 jenny staff 748 Jun 23 2014 model-exposition drwxr-xr-x 4 jenny staff 136 Jun 3 2014 results
Ready to analyze data:
Raw data:
R scripts + the Markdown files from "Compile Notebook":
The figures created in those R scripts and linked in those Markdown files:
Linear progression of R scripts, and Makefile to run the entire analysis:
Tab-delimited files with one row per gene of parameter estimates, test statistics, etc.:
Files to help collaborators understand the model we fit: some markdown docs, a Keynote presentation, Keynote slides exported as PNGs for viewability on GitHub:
This project is no where near done, i.e. no manuscript or publication-ready figs
File naming has inconsistencies due to three different people being involved
Code and reports/figures all sit together because it’s just much easier that way w/ knitr & rmarkdown
I can walk away from the project and come back to it a year later and resume work fairly quickly
The two other people (the post-doc whose project it is + the bioinformatician for that lab) were able to figure out what I did and decide which files they needed to look at, etc.
from_joe directoryLet's say my collaborator and data producer is Joe.
He will send me data with weird space-containing file names, data in Microsoft Excel workbooks, etc.
It is futile to fight this, just quarantine all the crazy here. I rename things and/or export to plain text and put those files in my data directory.
Whether I move, copy, or symlink depends on the situation.
Whatever I did gets recorded in a README or in comments in my R code – whatever makes it easiest for me to remind myself of a file's provenance, if it came from the outside world in a state that was not ready for programmatic analysis.
I often revoke my own write permission to the raw data file.
Then I can't accidentally edit it.
It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.
Sometimes you need a place to park key emails, internal documentation and explanations, random Word and PowerPoint docs people send, etc.
This is kind of like from_joe, where I don’t force myself to keep same standards with respect to file names and open formats.
Here’s how most data analyses go down in reality:
You get raw data
You explore, describe and visualize it
You diagnose what this data needs to become useful
You fix, clean, marshal the data into ready-to-analyze form
You visualize it some more
You fit a model or whatever and write lots of numerical results to file
You make prettier tables and many figures based on the data & results accumulated by this point
Both the data file(s) and the code/scripts that acts on them reflect this progression
The R scripts:
01_marshal-data.r 02_pre-dea-filtering.r 03_dea-with-limma-voom.r 04_explore-dea-results.r 90_limma-model-term-name-fiasco.r
The figures left behind:
02_pre-dea-filtering-preDE-filtering.png 03-dea-with-limma-voom-voom-plot.png 04_explore-dea-results-focus-term-adjusted-p-values1.png 04_explore-dea-results-focus-term-adjusted-p-values2.png ... 90_limma-model-term-name-fiasco-first-voom.png 90_limma-model-term-name-fiasco-second-voom.png
File organization should reflect inputs vs outputs and the flow of information
/Users/jenny/research/bohlmann/White_Pine_Weevil_DE: drwxr-xr-x 20 jenny staff 680 Apr 14 15:44 analysis drwxr-xr-x 7 jenny staff 238 Jun 3 2014 data drwxr-xr-x 22 jenny staff 748 Jun 23 2014 model-exposition drwxr-xr-x 4 jenny staff 136 Jun 3 2014 results